Single Cell Application Pt. 2: Cell-cell communication¶

Instructors:
Zongxu Zhang (zongxu.zhang@stu.pku.edu.cn)
Zexian Zeng (zexianzeng@pku.edu.cn)

Affiliation:
Center for Quantitative Biology
Academy for Advanced Interdisciplinary Studies
Peking University

Lab Website:
🔗 http://cqb.pku.edu.cn/zenglab/

This materials is largely based on CellphoneDB, we thank the origional authors for their contribution.

Review of Last Week’s Topics¶

Dynamo
A python packages for RNA velocity analysis, supplemented with differential gemotary analysis to reveal gene regulation process involved in differential process.

💡 Note: This course will primarily focus on Python-based workflows for scRNA-seq data analysis.

Introduction to cell cell communications¶

Cell communication is a process by which cells react to stimuli from their environment and also from themselves.

Autocrine signalling refers to intracellular communication whereby cells secrete ligands that are used to induce a cellular response through cognate receptors for those molecules expressed on the same cell.

Paracrine cell–cell communication does not require cell–cell contact, rather depending on the diffusion of signalling molecules from one cell to another after secretion.

Juxtacrine, that is, contact-dependent, cell–cell communication relies on gap junctions or other structures such as membrane nanotubes to pass signalling molecules directly between cells, without secretion into the extracellular space. Endocrine cell–cell communication represents intercellular communication whereby signalling molecules are secreted and travel long distances through extracellular fluids such as the blood plasma; typical mediators of this communication are hormones.

Cell communication is a process by which cells react to stimuli from their environment and also from themselves. In multicellular organisms, the dynamic coordination of cells, also called cell-cell communication (CCC), is involved in many biological processes, such as apoptosis and cell migration, and is consequently essential in homeostasis and disease.

CCC commonly focuses on protein-mediated interactions, most typically perceived as a secreted ligand binding to its corresponding plasma membrane receptor. However, this picture can be broadened to include secreted enzymes, extra-cellular matrix proteins, transporters, and interactions that require the physical contact between cells, such as cell-cell adhesion proteins and gap junctions.

Cell communication is further not independent of other processes, but the contrary, as external stimuli commonly elicit a downstream response. In the case of CCC, this is typically perceived as the induction of canonical pathways and downstream transcription factors in the cells receiving the signal, or receiver cells. Ultimately these external stimuli alter the function of receiver cells, and this alteration is further propagated via the subsequent interaction of these cells with their microenvironment.

Traditionally, the study of CCC required specialized in-situ biochemical assays, such proximity labelling proteomics, co-immunoprecipitation, and yeast two-hybrid screening. Yet, the rapid developments and dropping costs of transcriptomics data generation has enabled a paradigm shift away from focusing on which types of cells are present, but rather on the relationships between them. As a consequence, CCC inference from single-cell data is now becoming a routine approach, capable of providing a system-level hypotheses of intercellular crosstalk in vivo.

Typically, inference of Cell-cell communication from single cell RNA seq data involoves 6 steps:

Step 1 : Collection of samples or cells for transcriptomic analysis

Step 2 : Data preprocessing and generation of expression matrix

Step 3 : Collection of interacting proteins or ligand-receptors pairs

Step 4 : Filtering by interacting proteins or ligand-receptor pairs

Step 5 : Cell-cell interaction and communication analysis

Scoring function for cell cell communication

Step 6 : Results interpretation and visuzalization

A simple numerical cases:

Two primary inputs are used for quantifying communication scores: a preprocessed gene expression matrix (part a) and a list of interacting proteins to supervise the analysis (for example, ligand–receptor pairs) (part b). Then a communication score (CS) can be computed for every ligand–receptor pair in a given pair of cells. Here, we show how to perform these calculations for four core functions (parts c–f). These are applied to elucidate paracrine (parts c,d) and autocrine (parts e,f) communication. To assess cell–cell communication, a CS can be computed for each ligand–receptor pair by accounting for the presence of both partners if their expression is greater than a given threshold, which for demonstrative purposes was set arbitrarily to a value of 3.3 (part c), or by multiplying their expression values (part d). Similarly, the CS for each ligand–receptor pair can be the correlation score obtained from their expression across all cell types for autocrine communication (part e). To reveal non-autocrine interactions, the correlation can be computed across pairs of different cells. Particular signatures of each cell type can be extracted through analysing differentially expressed ligands and receptors. Using the cell type-specific differentially expressed genes, we can assign a binary CS and study the ligand–receptors used for autocrine communication (part f). In this example, autocrine communication is evaluated for cell type A by using its differentially expressed genes with respect to cell type B (cell type A-specific genes are located in the coloured quadrant). Analogously to the correlation score, for non-autocrine communication we would need to consider differentially expressed genes in each of the cell types or samples. For a given pair of cells, we can say that a communication pathway is active when the ligand is differentially expressed in one cell and its cognate receptor is differentially expressed in the other. FC, fold change.

Belows, we introduce cellphoneDB, the basic idea is:

To assess cellular crosstalk between different cell types, we used our repository in a statistical framework for inferring cell–cell communication networks from single-cell transcriptome data. We derived enriched receptor–ligand interactions between two cell types based on expression of a receptor by one cell type and a ligand by another cell type, using the droplet-based data. To identify the most relevant interactions between cell types, we looked for the cell-type specific interactions between ligands and receptors. Only receptors and ligands expressed in more than 10% of the cells in the specific cluster were considered.

We performed pairwise comparisons between all cell types. First, we randomly permuted the cluster labels of all cells 1,000 times and determined the mean of the average receptor expression level of a cluster and the average ligand expression level of the interacting cluster. For each receptor–ligand pair in each pairwise comparison between two cell types, this generated a null distribution. By calculating the proportion of the means which are ‘as or more extreme’ than the actual mean, we obtained a P value for the likelihood of cell-type specificity of a given receptor–ligand complex. We then prioritized interactions that are highly enriched between cell types based on the number of significant pairs, and manually selected biologically relevant ones. For the multi-subunit heteromeric complexes, we required that all subunits of the complex are expressed (using a threshold of 10%), and therefore we used the member of the complex with the minimum average expression to perform the random shuffling.

Citation:

  1. cellphoneDBv1

  2. The latest version: cellphoneDBv5

1. Download the cellphoneDB database¶

In [ ]:
### better run in a conda enviroments
! pip install cellphonedb
In [1]:
import pandas as pd
import glob
import os

from IPython.display import HTML, display
from cellphonedb.utils import db_releases_utils
In [2]:
# Direct show figure in jupyter
%matplotlib inline
In [3]:
# Check database version
display(HTML(db_releases_utils.get_remote_database_versions_html()['db_releases_html_table']))
VersionRelease date
v5.0.02023-10-31
v4.1.02023-03-09
In [4]:
# -- Version of the databse
cpdb_version = 'v5.0.0'

# -- Path where the input files to generate the database are located
cpdb_target_dir = os.path.join('./data/cellphoneDB', cpdb_version)
In [5]:
# donwload the database
from cellphonedb.utils import db_utils

db_utils.download_database(cpdb_target_dir, cpdb_version)
Downloaded cellphonedb.zip into ./data/cellphoneDB/v5.0.0
Downloaded complex_input.csv into ./data/cellphoneDB/v5.0.0
Downloaded gene_input.csv into ./data/cellphoneDB/v5.0.0
Downloaded interaction_input.csv into ./data/cellphoneDB/v5.0.0
Downloaded protein_input.csv into ./data/cellphoneDB/v5.0.0
Downloaded uniprot_synonyms.tsv into ./data/cellphoneDB/v5.0.0/sources
Downloaded transcription_factor_input.csv into ./data/cellphoneDB/v5.0.0/sources

2. Basic analysis: Score for interaction¶

Input files¶

The statistial method accepts 4 input files (3 mandatory).

  • cpdb_file_path: (mandatory) path to the database cellphonedb.zip.
  • meta_file_path: (mandatory) path to the meta file linking cell barcodes to cluster labels metadata.tsv.
  • counts_file_path: (mandatory) paths to normalized counts file (not z-transformed), either in text format or h5ad (recommended) normalised_log_counts.h5ad.
  • microenvs_file_path (optional) path to microenvironment file that groups cell types/clusters by microenvironments. When providing a microenvironment file, CellphoneDB will restrict the interactions to those cells within the microenvironment.

The microenvs_file_path content will depend on the biological question that the researcher wants to answer.

In this example we are studying how cell-cell interactions change between a subset of immune cells and trophoblast cells as the trophoblast differentiate and invade the maternal uterus. This module will randomly permute the cluster labels of all cells whitin each microenvironement (microenvs_file_path) 1,000 times (default) and determine the mean of the average receptor expression level in a cluster and the average ligand expression level in the interacting cluster. Then, we will obtain a P-value for the likelihood of cell-type specificity of a given receptor–ligand complex.

In [6]:
# set paths
cpdb_file_path = 'data/cellphoneDB/v5.0.0/cellphonedb.zip'
meta_file_path = 'data/cellphoneDB/metadata.tsv'
counts_file_path = 'data/cellphoneDB/normalised_log_counts.h5ad'
microenvs_file_path = 'data/cellphoneDB/microenvironment.tsv'
out_path = 'results/cellphoneDB/score/'

Inspect input files¶

1) The metadata file is compossed of two columns:

  • barcode_sample: this column indicates the barcode of each cell in the experiment.
  • cell_type: this column denotes the cell label assigned.
In [7]:
metadata = pd.read_csv(meta_file_path, sep = '\t')
metadata.head(3)
Out[7]:
barcode_sample cell_type
0 AGCGATTAGTCTAACC-1_Pla_HDBR10917733 B_cells
1 ATCCGTGAGGCTAGAA-1_Pla_Camb10714918 B_cells
2 AGTAACCCATTAAAGG-1_Pla_HDBR10917733 B_cells

2) The counts files is a scanpy h5ad object. The dimensions and order of this object must coincide with the dimensions of the metadata file (i.e. must have the same number of cells in both files).

In [8]:
import anndata

adata = anndata.read_h5ad(counts_file_path)
adata.shape
Out[8]:
(3312, 30800)

Check barcodes in metadata and counts are the same.

In [9]:
list(adata.obs.index).sort() == list(metadata['barcode_sample']).sort()
Out[9]:
True

3) Micronevironments defines the cell types that belong to a a given microenvironment. CellphoneDB will only calculate interactions between cells that belong to a given microenvironment. In this file we are defining just one microenvionment.

In [10]:
microenv = pd.read_csv(microenvs_file_path,
                       sep = '\t')
microenv.head(3)
Out[10]:
cell_type microenvironment
0 PV MMP11 Env1
1 PV MYH11 Env1
2 PV STEAP4 Env1
In [11]:
microenv.groupby('microenvironment', group_keys = False)['cell_type'] \
    .apply(lambda x : list(x.value_counts().index))
Out[11]:
microenvironment
Env1    [PV MMP11, PV MYH11, PV STEAP4, EVT_1, EVT_2, ...
Name: cell_type, dtype: object

Scoring the interaction¶

The output of this method will be saved in output_path and also returned to the predefined variables.

In [12]:
from cellphonedb.src.core.methods import cpdb_analysis_method

cpdb_results = cpdb_analysis_method.call(
    cpdb_file_path = cpdb_file_path,           # mandatory: CellphoneDB database zip file.
    meta_file_path = meta_file_path,           # mandatory: tsv file defining barcodes to cell label.
    counts_file_path = counts_file_path,       # mandatory: normalized count matrix - a path to the counts file, or an in-memory AnnData object
    counts_data = 'hgnc_symbol',               # defines the gene annotation in counts matrix.
    microenvs_file_path = microenvs_file_path, # optional (default: None): defines cells per microenvironment.
    score_interactions = True,                 # optional: whether to score interactions or not. 
    output_path = out_path,                    # Path to save results    microenvs_file_path = None,
    separator = '|',                           # Sets the string to employ to separate cells in the results dataframes "cellA|CellB".
    threads = 5,                               # number of threads to use in the analysis.
    threshold = 0.1,                           # defines the min % of cells expressing a gene for this to be employed in the analysis.
    result_precision = 3,                      # Sets the rounding for the mean values in significan_means.
    debug = False,                             # Saves all intermediate tables emplyed during the analysis in pkl format.
    output_suffix = None                       # Replaces the timestamp in the output files by a user defined string in the  (default: None)
)
[ ][CORE][10/11/25-12:11:17][INFO] [Non Statistical Method] Threshold:0.1 Precision:3
Reading user files...
The following user files were loaded successfully:
data/cellphoneDB/normalised_log_counts.h5ad
data/cellphoneDB/metadata.tsv
data/cellphoneDB/microenvironment.tsv
[ ][CORE][10/11/25-12:11:20][INFO] Running Basic Analysis
[ ][CORE][10/11/25-12:11:20][INFO] Limiting cluster combinations using microenvironments
[ ][CORE][10/11/25-12:11:20][INFO] Building results
[ ][CORE][10/11/25-12:11:21][INFO] Scoring interactions: Filtering genes per cell type..
100%|██████████| 42/42 [00:00<00:00, 330.13it/s]
[ ][CORE][10/11/25-12:11:21][INFO] Scoring interactions: Calculating mean expression of each gene per group/cell type..
100%|██████████| 42/42 [00:00<00:00, 1030.37it/s]
/home/zongxu/miniconda3/envs/spatialCC/lib/python3.11/site-packages/cellphonedb/utils/scoring_utils.py:138: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  matrix[index_name].replace(to_replace=id2name, inplace=True)
[ ][CORE][10/11/25-12:11:22][INFO] Scoring interactions: Calculating scores for all interactions and cell types..
100%|██████████| 81/81 [00:17<00:00,  4.70it/s]
Saved means_result to results/cellphoneDB/score/simple_analysis_means_result_11_10_2025_121139.txt
Saved deconvoluted to results/cellphoneDB/score/simple_analysis_deconvoluted_11_10_2025_121139.txt
Saved deconvoluted_percents to results/cellphoneDB/score/simple_analysis_deconvoluted_percents_11_10_2025_121139.txt
Saved interaction_scores to results/cellphoneDB/score/simple_analysis_interaction_scores_11_10_2025_121139.txt
In [13]:
print(cpdb_results.keys())
dict_keys(['means_result', 'deconvoluted', 'deconvoluted_percents', 'interaction_scores'])

Description of output files¶

Most output files share common columns:

  • id_cp_interaction: Unique CellphoneDB identifier for each interaction stored in the database.
  • interacting_pair: Name of the interacting pairs separated by “|”.
  • partner A or B: Identifier for the first interacting partner (A) or the second (B). It could be: UniProt (prefix simple:) or complex (prefix complex:)
  • gene A or B: Gene identifier for the first interacting partner (A) or the second (B). The identifier will depend on the input user list.
  • secreted: True if one of the partners is secreted.
  • Receptor A or B: True if the first interacting partner (A) or the second (B) is annotated as a receptor in our database.
  • annotation_strategy: Curated if the interaction was annotated by the CellphoneDB developers. Otherwise, the name of the database where the interaction has been downloaded from.
  • is_integrin: True if one of the partners is integrin.
  • directionality: Indiicates the directionality of the interaction and the charactersitics of the interactors.
  • classification: Pathway classification for the interacting partners.

Means fields:

  • means: Mean values for all the interacting partners: mean value refers to the total mean of the individual partner average expression values in the corresponding interacting pairs of cell types. If one of the mean values is 0, then the total mean is set to 0.
In [14]:
cpdb_results['means_result'].head(2)
Out[14]:
id_cp_interaction interacting_pair partner_a partner_b gene_a gene_b secreted receptor_a receptor_b annotation_strategy ... GC|eEVT eEVT|PV MMP11 eEVT|PV MYH11 eEVT|PV STEAP4 eEVT|EVT_1 eEVT|EVT_2 eEVT|iEVT eEVT|VCT_CCC eEVT|GC eEVT|eEVT
0 CPI-SC0A2DB962D CDH1_integrin_a2b1_complex simple:P12830 complex:integrin_a2b1_complex CDH1 NaN False False False curated ... 0.433 0.232 0.0 0.0 0.198 0.225 0.215 0.203 0.225 0.472
1 CPI-SC0B5CEA47D COL10A1_integrin_a2b1_complex simple:Q03692 complex:integrin_a2b1_complex COL10A1 NaN True False False curated ... 0.000 0.049 0.0 0.0 0.015 0.043 0.032 0.020 0.042 0.289

2 rows × 94 columns

In [15]:
cpdb_results['interaction_scores'].head(2)
Out[15]:
id_cp_interaction interacting_pair partner_a partner_b gene_a gene_b secreted receptor_a receptor_b annotation_strategy ... GC|eEVT eEVT|PV MMP11 eEVT|PV MYH11 eEVT|PV STEAP4 eEVT|EVT_1 eEVT|EVT_2 eEVT|iEVT eEVT|VCT_CCC eEVT|GC eEVT|eEVT
0 CPI-SC0A2DB962D CDH1_integrin_a2b1_complex simple:P12830 complex:integrin_a2b1_complex CDH1 NaN False False False curated ... 41.35 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 51.391
1 CPI-SC0B5CEA47D COL10A1_integrin_a2b1_complex simple:Q03692 complex:integrin_a2b1_complex COL10A1 NaN True False False curated ... 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000

2 rows × 94 columns

Deconvoluted fields:

  • gene_name: Gene identifier for one of the subunits that are participating in the interaction defined in “means.csv” file. The identifier will depend on the input of the user list.
  • uniprot: UniProt identifier for one of the subunits that are participating in the interaction defined in “means.csv” file.
  • is_complex: True if the subunit is part of a complex. Single if it is not, complex if it is.
  • protein_name: Protein name for one of the subunits that are participating in the interaction defined in “means.csv” file.
  • complex_name: Complex name if the subunit is part of a complex. Empty if not.
  • id_cp_interaction: Unique CellphoneDB identifier for each of the interactions stored in the database.
  • mean: Mean expression of the corresponding gene in each cluster.
In [16]:
cpdb_results['deconvoluted'].head(2)
Out[16]:
gene_name uniprot is_complex protein_name complex_name id_cp_interaction gene B_cells DC EVT_1 ... dS1 dS2 dS3 dT_cells dT_regs eEVT fF1 fF2 iEVT uSMC
multidata_id
1355 UBASH3B Q8TF42 True UBS3B_HUMAN Dehydroepiandrosterone_bySTS CPI-CS09B8977D7 UBASH3B 0.24 0.0 0.621 ... 0.118 0.224 0.162 0.025 0.124 1.182 0.055 0.0 1.402 0.338
1355 UBASH3B Q8TF42 True UBS3B_HUMAN Dehydroepiandrosterone_bySTS CPI-CS05760BB78 UBASH3B 0.24 0.0 0.621 ... 0.118 0.224 0.162 0.025 0.124 1.182 0.055 0.0 1.402 0.338

2 rows × 49 columns

Deconvoluted percents fields:

  • percents: Percent of cells (clusters) expressing the given gene.
In [17]:
cpdb_results['deconvoluted_percents'].head(2)
Out[17]:
gene_name uniprot is_complex protein_name complex_name id_cp_interaction gene B_cells DC EVT_1 ... dS1 dS2 dS3 dT_cells dT_regs eEVT fF1 fF2 iEVT uSMC
multidata_id
1355 UBASH3B Q8TF42 True UBS3B_HUMAN Dehydroepiandrosterone_bySTS CPI-CS09B8977D7 UBASH3B 0.125 0.0 0.698 ... 0.123 0.24 0.354 0.021 0.053 0.857 0.048 0.0 0.892 0.359
1355 UBASH3B Q8TF42 True UBS3B_HUMAN Dehydroepiandrosterone_bySTS CPI-CS05760BB78 UBASH3B 0.125 0.0 0.698 ... 0.123 0.24 0.354 0.021 0.053 0.857 0.048 0.0 0.892 0.359

2 rows × 49 columns

3. Run statistical analysis: extract significant interaction pairs¶

The output of this method will be saved in output_path and also returned to the predefined variables.

The statisical method allows the user to downsample the data with the aim of speeding up the results (subsampling arguments). To this end, CellphoneDB employs a geometric sketching procedure (Hie et al. 2019) to preserve the structure of the data without losing information from lowly represented cells. For this tutorial, we have opted to manually downsample the count matrix and the metadata file accordingly.

In [18]:
# set path
cpdb_file_path = 'data/cellphoneDB/v5.0.0/cellphonedb.zip'
meta_file_path = 'data/cellphoneDB/metadata.tsv'
counts_file_path = 'data/cellphoneDB/normalised_log_counts.h5ad'
microenvs_file_path = 'data/cellphoneDB/microenvironment.tsv'
active_tf_path = 'data/cellphoneDB/active_TFs.tsv'
out_path = 'results/cellphoneDB/sig/'
In [19]:
from cellphonedb.src.core.methods import cpdb_statistical_analysis_method

cpdb_results = cpdb_statistical_analysis_method.call(
    cpdb_file_path = cpdb_file_path,                 # mandatory: CellphoneDB database zip file.
    meta_file_path = meta_file_path,                 # mandatory: tsv file defining barcodes to cell label.
    counts_file_path = counts_file_path,             # mandatory: normalized count matrix - a path to the counts file, or an in-memory AnnData object
    counts_data = 'hgnc_symbol',                     # defines the gene annotation in counts matrix.
    active_tfs_file_path = active_tf_path,           # optional: defines cell types and their active TFs.
    microenvs_file_path = microenvs_file_path,       # optional (default: None): defines cells per microenvironment.
    score_interactions = True,                       # optional: whether to score interactions or not. 
    iterations = 1000,                               # denotes the number of shufflings performed in the analysis.
    threshold = 0.1,                                 # defines the min % of cells expressing a gene for this to be employed in the analysis.
    threads = 5,                                     # number of threads to use in the analysis.
    debug_seed = 42,                                 # debug randome seed. To disable >=0.
    result_precision = 3,                            # Sets the rounding for the mean values in significan_means.
    pvalue = 0.05,                                   # P-value threshold to employ for significance.
    subsampling = False,                             # To enable subsampling the data (geometri sketching).
    subsampling_log = False,                         # (mandatory) enable subsampling log1p for non log-transformed data inputs.
    subsampling_num_pc = 100,                        # Number of componets to subsample via geometric skectching (dafault: 100).
    subsampling_num_cells = 1000,                    # Number of cells to subsample (integer) (default: 1/3 of the dataset).
    separator = '|',                                 # Sets the string to employ to separate cells in the results dataframes "cellA|CellB".
    debug = False,                                   # Saves all intermediate tables employed during the analysis in pkl format.
    output_path = out_path,                          # Path to save results.
    output_suffix = None                             # Replaces the timestamp in the output files by a user defined string in the  (default: None).
    )
Reading user files...
The following user files were loaded successfully:
data/cellphoneDB/normalised_log_counts.h5ad
data/cellphoneDB/metadata.tsv
data/cellphoneDB/microenvironment.tsv
data/cellphoneDB/active_TFs.tsv
[ ][CORE][10/11/25-12:11:42][INFO] [Cluster Statistical Analysis] Threshold:0.1 Iterations:1000 Debug-seed:42 Threads:5 Precision:3
[ ][CORE][10/11/25-12:11:42][WARNING] Debug random seed enabled. Set to 42
[ ][CORE][10/11/25-12:11:42][INFO] Running Real Analysis
[ ][CORE][10/11/25-12:11:42][INFO] Limiting cluster combinations using microenvironments
[ ][CORE][10/11/25-12:11:42][INFO] Running Statistical Analysis
100%|██████████| 1000/1000 [00:25<00:00, 39.41it/s]
[ ][CORE][10/11/25-12:12:08][INFO] Building Pvalues result

[ ][CORE][10/11/25-12:12:08][INFO] Building results
[ ][CORE][10/11/25-12:12:09][INFO] Scoring interactions: Filtering genes per cell type..
100%|██████████| 42/42 [00:00<00:00, 314.53it/s]
[ ][CORE][10/11/25-12:12:10][INFO] Scoring interactions: Calculating mean expression of each gene per group/cell type..
100%|██████████| 42/42 [00:00<00:00, 920.28it/s]
[ ][CORE][10/11/25-12:12:10][INFO] Scoring interactions: Calculating scores for all interactions and cell types..
100%|██████████| 81/81 [00:16<00:00,  4.95it/s]
Saved deconvoluted to results/cellphoneDB/sig/statistical_analysis_deconvoluted_11_10_2025_121227.txt
Saved deconvoluted_percents to results/cellphoneDB/sig/statistical_analysis_deconvoluted_percents_11_10_2025_121227.txt
Saved means to results/cellphoneDB/sig/statistical_analysis_means_11_10_2025_121227.txt
Saved pvalues to results/cellphoneDB/sig/statistical_analysis_pvalues_11_10_2025_121227.txt
Saved significant_means to results/cellphoneDB/sig/statistical_analysis_significant_means_11_10_2025_121227.txt
Saved CellSign_active_interactions to results/cellphoneDB/sig/statistical_analysis_CellSign_active_interactions_11_10_2025_121227.txt
Saved CellSign_active_interactions_deconvoluted to results/cellphoneDB/sig/statistical_analysis_CellSign_active_interactions_deconvoluted_11_10_2025_121227.txt
Saved interaction_scores to results/cellphoneDB/sig/statistical_analysis_interaction_scores_11_10_2025_121227.txt
In [20]:
# help(cpdb_statistical_analysis_method.call)

Results are save into the predefined file and as a dictionary in the cpdb_results variable.

In [21]:
list(cpdb_results.keys())
Out[21]:
['deconvoluted',
 'deconvoluted_percents',
 'means',
 'pvalues',
 'significant_means',
 'CellSign_active_interactions',
 'CellSign_active_interactions_deconvoluted',
 'interaction_scores']

Description of output files¶

Most output files share common columns:

  • id_cp_interaction: Unique CellphoneDB identifier for each interaction stored in the database.
  • interacting_pair: Name of the interacting pairs separated by “|”.
  • partner A or B: Identifier for the first interacting partner (A) or the second (B). It could be: UniProt (prefix simple:) or complex (prefix complex:)
  • gene A or B: Gene identifier for the first interacting partner (A) or the second (B). The identifier will depend on the input user list.
  • secreted: True if one of the partners is secreted.
  • Receptor A or B: True if the first interacting partner (A) or the second (B) is annotated as a receptor in our database.
  • annotation_strategy: Curated if the interaction was annotated by the CellphoneDB developers. Otherwise, the name of the database where the interaction has been downloaded from.
  • is_integrin: True if one of the partners is integrin.
  • directionality: Indiicates the directionality of the interaction and the charactersitics of the interactors.
  • classification: Pathway classification for the interacting partners.

Pvalues fields:

  • cell_a|cell_b: The p-value resulting from the statistical analysis.
In [22]:
cpdb_results['pvalues'].head(2)
Out[22]:
id_cp_interaction interacting_pair partner_a partner_b gene_a gene_b secreted receptor_a receptor_b annotation_strategy ... GC|eEVT eEVT|PV MMP11 eEVT|PV MYH11 eEVT|PV STEAP4 eEVT|EVT_1 eEVT|EVT_2 eEVT|iEVT eEVT|VCT_CCC eEVT|GC eEVT|eEVT
0 CPI-SC0A2DB962D CDH1_integrin_a2b1_complex simple:P12830 complex:integrin_a2b1_complex CDH1 NaN False False False curated ... 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
1 CPI-SC0B5CEA47D COL10A1_integrin_a2b1_complex simple:Q03692 complex:integrin_a2b1_complex COL10A1 NaN True False False curated ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

2 rows × 94 columns

Means fields:

  • means: Mean values for all the interacting partners: mean value refers to the total mean of the individual partner average expression values in the corresponding interacting pairs of cell types. If one of the mean values is 0, then the total mean is set to 0.
In [23]:
cpdb_results['means'].head(2)
Out[23]:
id_cp_interaction interacting_pair partner_a partner_b gene_a gene_b secreted receptor_a receptor_b annotation_strategy ... GC|eEVT eEVT|PV MMP11 eEVT|PV MYH11 eEVT|PV STEAP4 eEVT|EVT_1 eEVT|EVT_2 eEVT|iEVT eEVT|VCT_CCC eEVT|GC eEVT|eEVT
0 CPI-SC0A2DB962D CDH1_integrin_a2b1_complex simple:P12830 complex:integrin_a2b1_complex CDH1 NaN False False False curated ... 0.433 0.232 0.0 0.0 0.198 0.225 0.215 0.203 0.225 0.472
1 CPI-SC0B5CEA47D COL10A1_integrin_a2b1_complex simple:Q03692 complex:integrin_a2b1_complex COL10A1 NaN True False False curated ... 0.000 0.049 0.0 0.0 0.015 0.043 0.032 0.020 0.042 0.289

2 rows × 94 columns

Significant means fields:

  • significant_mean: Significant mean calculation for all the interacting partners. If the interaction has been found relevant, the value will be the mean. Alternatively, the value is absent.
In [24]:
cpdb_results['significant_means'].head(2)
Out[24]:
id_cp_interaction interacting_pair partner_a partner_b gene_a gene_b secreted receptor_a receptor_b annotation_strategy ... GC|eEVT eEVT|PV MMP11 eEVT|PV MYH11 eEVT|PV STEAP4 eEVT|EVT_1 eEVT|EVT_2 eEVT|iEVT eEVT|VCT_CCC eEVT|GC eEVT|eEVT
2581 CPI-SC012DB3194 WNT10A_FZD5_LRP5 simple:Q9GZT5 complex:FZD5_LRP5 WNT10A NaN True False False curated ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2427 CPI-SC0065F5B8A WNT10A_FZD1_LRP6 simple:Q9GZT5 complex:FZD1_LRP6 WNT10A NaN True False False curated ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2 rows × 95 columns

Interaction scores fields:

  • scores: scores ranging from 0 to 100. The higher the score is, the more specific the interaction is expected to be.
In [25]:
cpdb_results['interaction_scores'].head(2)
Out[25]:
id_cp_interaction interacting_pair partner_a partner_b gene_a gene_b secreted receptor_a receptor_b annotation_strategy ... GC|eEVT eEVT|PV MMP11 eEVT|PV MYH11 eEVT|PV STEAP4 eEVT|EVT_1 eEVT|EVT_2 eEVT|iEVT eEVT|VCT_CCC eEVT|GC eEVT|eEVT
0 CPI-SC0A2DB962D CDH1_integrin_a2b1_complex simple:P12830 complex:integrin_a2b1_complex CDH1 NaN False False False curated ... 41.35 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 51.391
1 CPI-SC0B5CEA47D COL10A1_integrin_a2b1_complex simple:Q03692 complex:integrin_a2b1_complex COL10A1 NaN True False False curated ... 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000

2 rows × 94 columns

CellSign fields:

  • interacting_pair: Name of the interacting pairs.
  • partner A or B: Identifier for the first interacting partner (A) or the second (B). It could be: UniProt (prefix simple:) or complex (prefix complex:)
  • gene A or B: Gene identifier for the first interacting partner (A) or the second (B). The identifier will depend on the input user list.
  • Receptor A or B: True if the first interacting partner (A) or the second (B) is annotated as a receptor in our database.
  • TF relationship: a 1 denotes that the TF downstream the receptor is active.
In [26]:
cpdb_results['CellSign_active_interactions'].head(2)
Out[26]:
id_cp_interaction interacting_pair partner_a partner_b gene_a gene_b secreted receptor_a receptor_b annotation_strategy ... GC|eEVT eEVT|PV MMP11 eEVT|PV MYH11 eEVT|PV STEAP4 eEVT|EVT_1 eEVT|EVT_2 eEVT|iEVT eEVT|VCT_CCC eEVT|GC eEVT|eEVT
0 CPI-SS0CC966B71 BTC_EGFR simple:P35070 simple:P00533 BTC EGFR True False True curated ... 0 0 0 0 0 0 0 0 0 0
1 CPI-SS076940BCD LGALS3_MERTK simple:P17931 simple:Q12866 LGALS3 MERTK True False True curated ... 0 0 0 0 0 0 0 0 0 0

2 rows × 95 columns

Deconvoluted fields:

  • gene_name: Gene identifier for one of the subunits that are participating in the interaction defined in “means.csv” file. The identifier will depend on the input of the user list.
  • uniprot: UniProt identifier for one of the subunits that are participating in the interaction defined in “means.csv” file.
  • is_complex: True if the subunit is part of a complex. Single if it is not, complex if it is.
  • protein_name: Protein name for one of the subunits that are participating in the interaction defined in “means.csv” file.
  • complex_name: Complex name if the subunit is part of a complex. Empty if not.
  • mean: Mean expression of the corresponding gene in each cluster.
In [27]:
cpdb_results['deconvoluted'].head(2)
Out[27]:
gene_name uniprot is_complex protein_name complex_name id_cp_interaction gene B_cells DC EVT_1 ... dS1 dS2 dS3 dT_cells dT_regs eEVT fF1 fF2 iEVT uSMC
multidata_id
1355 UBASH3B Q8TF42 True UBS3B_HUMAN Dehydroepiandrosterone_bySTS CPI-CS09B8977D7 UBASH3B 0.24 0.0 0.621 ... 0.118 0.224 0.162 0.025 0.124 1.182 0.055 0.0 1.402 0.338
1355 UBASH3B Q8TF42 True UBS3B_HUMAN Dehydroepiandrosterone_bySTS CPI-CS05760BB78 UBASH3B 0.24 0.0 0.621 ... 0.118 0.224 0.162 0.025 0.124 1.182 0.055 0.0 1.402 0.338

2 rows × 49 columns

Explore CellphoneDB results¶

This module allows to filter CellphoneDB results by specifying: cell types pairs, genes or specific interactions.

We can specify two list of cell types (query_cell_type_1and query_cell_type_2), the method will perform a pairwise comparison between the cell types defined in each list and will subset the interactions to those pairs of cells. Cell types within each list will not be paired. If we are interested in filtering interactions ocurring between a given cell to all the rest of cells in the dataset we can define query_cell_type_1 = 'All' and query_cell_type_2 = ['cellA', 'cellB', ...]. The argument genes allows the user to filter interactions in which a gene participates, the users can also define specific interactions based on the interaction name query_interactions.

In this example we are going to filter interactions in which a subset of trophoblast interact with the PV cells, furthermore, we will select those interactions in which the gene TGFBR1 participates and a specific interaction named CSF1_CSF1R.

This method filters the rows and columns of the significant_means file; NaN value correspond to interacting pairs found not significant, non-NaN value correspond to those interacting pairs found relevant.

In [28]:
from cellphonedb.utils import search_utils

search_results = search_utils.search_analysis_results(
    query_cell_types_1 = ['EVT_1', 'EVT_2', 'GC', 'eEVT', 'iEVT'],  # List of cells 1, will be paired to cells 2 (list or 'All').
    query_cell_types_2 = ['PV MMP11', 'PV MYH11', 'PV STEAP4'],     # List of cells 2, will be paired to cells 1 (list or 'All').
    query_genes = ['TGFBR1'],                                       # filter interactions based on the genes participating (list).
    query_interactions = ['CSF1_CSF1R'],                            # filter intereactions based on their name (list).
    significant_means = cpdb_results['significant_means'],          # significant_means file generated by CellphoneDB.
    deconvoluted = cpdb_results['deconvoluted'],                    # devonvoluted file generated by CellphoneDB.
    interaction_scores = cpdb_results['interaction_scores'],        # interaction score generated by CellphoneDB.
    query_minimum_score = 50,                                       # minimum score that an interaction must have to be filtered.
    separator = '|',                                                # separator (default: |) employed to split cells (cellA|cellB).
    long_format = True,                                             # converts the output into a wide table, removing non-significant interactions
    query_classifications = ['Signaling by Transforming growth factor']
)

search_results.head()
Out[28]:
interacting_pair partner_a partner_b gene_a gene_b directionality classification interacting_cells significant_mean
0 GDF11_TGFR_AVR2B simple:O95390 complex:TGFR_AVR2B GDF11 NaN Ligand-Receptor Signaling by Growth differentiation factor PV MMP11|EVT_2 0.105
4 TGFB1_TGFBR3 simple:P01137 simple:Q03167 TGFB1 TGFBR3 Ligand-Receptor Signaling by Transforming growth factor PV MMP11|EVT_2 0.590
5 GDF11_TGFR_AVR2B simple:O95390 complex:TGFR_AVR2B GDF11 NaN Ligand-Receptor Signaling by Growth differentiation factor PV MMP11|iEVT 0.105
8 TGFB1_TGFbeta_receptor1 simple:P01137 complex:TGFbeta_receptor1 TGFB1 NaN Ligand-Receptor Signaling by Transforming growth factor PV MMP11|iEVT 0.446
9 TGFB1_TGFBR3 simple:P01137 simple:Q03167 TGFB1 TGFBR3 Ligand-Receptor Signaling by Transforming growth factor PV MMP11|iEVT 0.771

4. Basic analysis and plotting¶

In this section of the tutorial we will make use of Kelvin's implementations (kt-plots tutorial) to perform some exploratory analysis of the results obtained from CellphoneDB. It is of great importance that the user is familiarized with the biological context of the data to obtain meaninful results from CellphoneDB results.

In [29]:
import os
import anndata as ad
import pandas as pd
import ktplotspy as kpy
import matplotlib.pyplot as plt
%matplotlib inline

Display as heatmap the number of significant interactions found by CellphoneDB between all of the cell pairs. Only cells found in the microenvironment will be represented in this plot.

In [30]:
kpy.plot_cpdb_heatmap(pvals = cpdb_results['pvalues'],
                      degs_analysis = False,
                      figsize = (5, 5),
                      title = "Sum of significant interactions")
Out[30]:
<seaborn.matrix.ClusterGrid at 0x7fede08651d0>

Here we are plotting the interactions between the PVs and the trophoblasts that are mediated by TGFB2 and CSF1R.

In [31]:
kpy.plot_cpdb(
    adata = adata,
    cell_type1 = "PV MYH11|PV STEAP4|PV MMPP11",
    cell_type2 = "EVT_1|EVT_2|GC|iEVT|eEVT|VCT_CCC",
    means = cpdb_results['means'],
    pvals = cpdb_results['pvalues'],
    celltype_key = "cell_labels",
    genes = ["TGFB2", "CSF1R"],
    figsize = (10, 3),
    title = "Interactions between\nPV and trophoblast",
    max_size = 3,
    highlight_size = 0.75,
    degs_analysis = False,
    standard_scale = True,
    interaction_scores = cpdb_results['interaction_scores'],
    scale_alpha_by_interaction_scores = True
)
Out[31]:

Interactions can also be plotted grouped by pathway.

In [32]:
from plotnine import facet_wrap

p = kpy.plot_cpdb(
    adata = adata,
    cell_type1 = "PV MYH11",
    cell_type2 = "EVT_1|EVT_2|GC|iEVT|eEVT|VCT_CCC",
    means = cpdb_results['means'],
    pvals = cpdb_results['pvalues'],
    celltype_key = "cell_labels",
    genes = ["TGFB2", "CSF1R", "COL1A1"],
    figsize = (12, 8),
    title = "Interactions between PV and trophoblast\ns grouped by classification",
    max_size = 6,
    highlight_size = 0.75,
    degs_analysis = False,
    standard_scale = True,
)
p + facet_wrap("~ classification", ncol = 1)
Out[32]:
In [33]:
import sys
import pkg_resources

print("### Session Info ###\n")

loaded_pkgs = []

for mod_name, module in sys.modules.items():
    file_path = getattr(module, "__file__", None)
    if (
        file_path
        and "site-packages" in file_path
        and not mod_name.startswith("_")
        and "." not in mod_name
    ):
        try:
            version = pkg_resources.get_distribution(mod_name).version
            loaded_pkgs.append((mod_name, version))
        except Exception:
            pass

for name, version in sorted(set(loaded_pkgs)):
    print(f"{name:<15} v{version}")
/tmp/ipykernel_101942/219190044.py:2: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
### Session Info ###

IPython         v8.20.0
anndata         v0.10.5.post1
asciitree       v0.3.3
asttokens       v2.4.1
brotli          v1.1.0
cellphonedb     v5.0.1
certifi         v2025.7.14
cffi            v1.16.0
chardet         v5.2.0
charset_normalizer v3.3.2
cloudpickle     v3.0.0
colorama        v0.4.6
comm            v0.2.1
cycler          v0.12.1
cytoolz         v0.12.2
dask            v2024.1.1
debugpy         v1.8.0
decorator       v5.1.1
executing       v2.0.1
fastcluster     v1.2.6
fasteners       v0.17.3
fbpca           v1.0
geosketch       v1.3
h5py            v3.12.1
idna            v3.6
importlib_metadata v7.0.1
ipykernel       v6.29.0
jedi            v0.19.1
jinja2          v3.1.3
joblib          v1.3.2
jupyter_client  v8.6.0
jupyter_core    v5.7.1
kiwisolver      v1.4.5
ktplotspy       v0.3.3
llvmlite        v0.43.0
lz4             v4.3.3
markupsafe      v2.1.4
matplotlib      v3.8.2
matplotlib_inline v0.1.6
mizani          v0.14.3
msgpack         v1.0.7
natsort         v8.4.0
numba           v0.60.0
numcodecs       v0.12.1
numpy           v1.26.4
numpy_groupies  v0.11.3
packaging       v23.2
pandas          v2.2.0
parso           v0.8.3
patsy           v0.5.6
pexpect         v4.9.0
pickleshare     v0.7.5
platformdirs    v4.1.0
plotnine        v0.15.1
prompt_toolkit  v3.0.42
psutil          v5.9.8
ptyprocess      v0.7.0
pure_eval       v0.2.2
pyarrow         v17.0.0
pycirclize      v1.10.1
pycparser       v2.21
pygments        v2.17.2
pyparsing       v3.1.1
pytz            v2023.3.post1
requests        v2.31.0
scipy           v1.12.0
seaborn         v0.13.2
six             v1.16.0
stack_data      v0.6.2
statsmodels     v0.14.5
tabulate        v0.9.0
tblib           v3.0.0
threadpoolctl   v3.2.0
toolz           v0.12.1
tornado         v6.3.3
tqdm            v4.66.1
traitlets       v5.14.1
typing_extensions v4.14.1
urllib3         v2.5.0
wcwidth         v0.2.13
zarr            v2.16.1
zipp            v3.17.0

Additional viz¶

For chord plot, refer to:
https://github.com/moshi4/pyCirclize
https://jokergoo.github.io/circlize/